Skip to content

Spark 46959#54812

Open
azmatsiddique wants to merge 3 commits intoapache:masterfrom
azmatsiddique:SPARK-46959
Open

Spark 46959#54812
azmatsiddique wants to merge 3 commits intoapache:masterfrom
azmatsiddique:SPARK-46959

Conversation

@azmatsiddique
Copy link

What changes were proposed in this pull request?

This PR fixes an issue where the CSV reader inconsistently parses empty quoted strings ("") when the escape option is set to an empty string ("").

Previously, if escape="" was used, mid-line empty quoted strings were correctly resolved to an empty string, but the last column resolved to a literal " character. This occurred because Spark Maps escape="" to \u0000, which univocity’s parser relies on. At the end of a line, without a trailing delimiter, univocity misinterprets the second " as an escaped quote rather than a closing quote.

The fix introduces a post-processing step in UnivocityParser.parseLine to detect this specific condition (a single quote character as the last token when escape is \u0000) and replace it with the configured emptyValueInRead.

Why are the changes needed?

To ensure consistent parsing of CSV data regardless of whether an empty quoted string appears in the middle of a line or at the end of a line.

Does this PR introduce any user-facing change?

Yes, it fixes a bug where users were receiving incorrect data (a literal quote instead of an empty/null value) for the last column in a row under specific CSV configurations (e.g. escape="", quote="\"", sep=";").

How was this patch tested?

Added a new regression test in CSVSuite.scala:
"SPARK-46959: CSV reader reads data inconsistently depending on column position"

The test verifies that an empty quoted string behaves identically in the mid-line position (column c) and the end-of-line position (column d) when configured with escape="" and nullValue="".

Verified that both CSVv1Suite and CSVv2Suite pass without regressions.

Was this patch authored or co-authored using generative AI tooling?

No

…n last CSV column

### What changes were proposed in this pull request?
This PR fixes an issue where the CSV reader inconsistently parses empty quoted strings (`""`) when the `escape` option is set to an empty string (`""`). Previously, mid-line empty quoted strings correctly resolved to null/empty, but the last column resolved to a literal `"` character due to univocity parser behavior.

### Why are the changes needed?
To ensure consistent parsing of CSV data regardless of column position.

### Does this PR introduce _any_ user-facing change?
Yes, it fixes a bug where users were receiving incorrect data (a literal quote instead of an empty/null value) for the last column in a row under specific CSV configurations.

### How was this patch tested?
Added a new regression test in `CSVSuite` that verifies consistent parsing of both mid-line and end-of-line empty quoted fields.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant